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Abstract 

We investigate whether one can determine 
from the transcripts of U.S. Congressional 
floor debates whether the speeches repre- 
sent support of or opposition to proposed 
legislation. To address this problem, we 
exploit the fact that these speeches occur 
as part of a discussion; this allows us to 
use sources of information regarding re- 
lationships between discourse segments, 
such as whether a given utterance indicates 
agreement with the opinion expressed by 
another. We find that the incorporation 
of such information yields substantial im- 
provements over classifying speeches in 
isolation. 

1 Introduction 

One ought to recognize that the present 
political chaos is connected with the de- 
cay of language, and that one can prob- 
ably bring about some improvement by 
starting at the verbal end. — Orwell, 
"Politics and the English language" 

We have entered an era where very large 
amounts of politically oriented text are now avail- 
able online. This includes both official documents, 
such as the full text of laws and the proceedings of 
legislative bodies, and unofficial documents, such 
as postings on weblogs (blogs) devoted to politics. 
In some sense, the availability of such data is sim- 
ply a manifestation of a general trend of "every- 
body putting their records on the Internet".^ The 

'it is worth pointing out that the United States' Library of 
Congress was an extremely early adopter of Web technology: 
the THOMAS database (http://thomas.loc.gov ( of congres- 



online accessibility of politically oriented texts in 
particular, however, is a phenomenon that some 
have gone so far as to say will have a potentially 
society-changing effect. 

In the United States, for example, governmen- 
tal bodies are providing and soliciting political 
documents via the Internet, with lofty goals in 
mind: electronic rulemaking (eRulemaking) ini- 
tiatives involving the "electronic collection, dis- 
tribution, synthesis, and analysis of public com- 
mentary in the regulatory rulemaking process", 
may "[alter] the citizen-government relationship" 
fS hulman and Schlosberg, 20021 1. Additionally, 
much media attention has been focused recently 
on the potential impact that Internet sites may have 
on politics^, or at least on political journalism^. 
Regardless of whether one views such claims as 
clear-sighted prophecy or mere hype, it is obvi- 
ously important to help people understand and an- 
alyze politically oriented text, given the impor- 
tance of enabling informed participation in the po- 
litical process. 

Evaluative and persuasive documents, such as a 
politician's speech regarding a bill or a blogger's 
commentary on a legislative proposal, form a 
particularly interesting type of politically oriented 
text. People are much more likely to consult such 
evaluative statements than the actual text of a bill 
or law under discussion, given the dense nature 
of legislative language and the fact that (U.S.) 
bills often reach several hundred pages in length 
(Smith, Roberts, and Vander Wiele n72005| l. 

sional bills and related data was launched in January 1995, 
when Mosaic was not quite two years old and Altavista did 
not yet exist. 

^E.g., "Internet injects sweeping change into U.S. poli- 
tics", Adam Nagourney, The New York Times, April 2, 2006. 

^E.g., "The End of News?", Michael Massing, The New 
York Review of Books, December 1, 2005. 



Moreover, political opinions are explicitly 
solicited in the eRulemaking scenario. 

In the analysis of evaluative language, it is fun- 
damentally necessary to determine whether the au- 
thor/speaker supports or disapproves of the topic 
of discussion. In this paper, we investigate the 
following specific instantiation of this problem: 
we seek to determine from the transcripts of 
U.S. Congressional floor debates whether each 
"speech" (continuous single-speaker segment of 
text) represents support for or opposition to a pro- 
posed piece of legislation. Note that from an ex- 
perimental point of view, this is a very convenient 
problem to work with because we can automati- 
cally determine ground truth (and thus avoid the 
need for manual annotation) simply by consulting 
publicly available voting records. 

Task properties Determining whether or not 
a speaker supports a proposal falls within the 
realm of sentiment analysis, an extremely active 
research area devoted to the computational treat- 
ment of subjective or opinion-oriented language 
(early work includes Wiebe and Rapaport ( .198 8). 
Hearst (1992.1, Sack (1994T), and |Wiebe (1994| ; 
see Esuli (120061) for an active bibliography). 
In particular, since we treat each individual 
speech within a debate as a single "document", 
we are considering a version of document-level 
sentiment-polarity classification, namely, auto- 
matically distinguishing between positive and 
negative documents fPas an d Chen, 200T1 
Pang, Lee, and Vaithyanathan, 2002^ 



evidence of a high likelihood of agreement be- 
tween two speakers, such as explicit assertions ("I 
second that!") or quotation of messages in emails 



or postings (see Mullen and Malouf (2006 1 but cf 



Tumey, 2002HDave, Lawrence, and Pennock, 2003| l. 

Most sentiment-polarity classifiers pro- 
posed in the recent literature categorize 
each document independently. A few oth- 
ers incorporate various measures of inter- 
document similarity between the texts to be 
labeled ( |Agarwa l and Bhattacharyya, 2005* 

Pang and Lee, 2005| |Goldberg and Zhu, 2006) . 
Many interesting opinion-oriented documents, 
however, can be linked through certain rela- 
tionships that occur in the context of evaluative 
discussions. For example, we may find textual^ 

''Because we are most interested in techniques applicable 
across domains, we restrict consideration to NLP aspects of 
the problem, ignoring external problem-specific information. 
For example, although most votes in our corpus were almost 
completely along party lines (and despite the fact that same- 
party information is easily incorporated via the methods we 
propose), we did not use party-affiliation data. Indeed, in 
other settings (e.g., a movie-discussion listserv) one may not 
be able to determine the participants' political leanings, and 



Agrawal et al. (2003, )). Agreement evidence can 



be a powerful aid in our classification task: for 
example, we can easily categorize a complicated 
(or overly terse) document if we find within it 
indications of agreement with a clearly positive 
text. 

Obviously, incorporating agreement informa- 
tion provides additional benefit only when the in- 
put documents are relatively difficult to classify 
individually. Intuition suggests that this is true 
of the data with which we experiment, for several 
reasons. First, U.S. congressional debates contain 
very rich language and cover an extremely wide 
variety of topics, ranging from flag burning to in- 
ternational policy to the federal budget. Debates 
are also subject to digressions, some fairly natural 
and others less so (e.g., "Why are we discussing 
this bill when the plight of my constituents regard- 
ing this other issue is being ignored?") 

Second, an important characteristic of persua- 
sive language is that speakers may spend more 
time presenting evidence in support of their po- 
sitions (or attacking the evidence presented by 
others) than directly stating their attitudes. An 
extreme example will illustrate the problems in- 
volved. Consider a speech that describes the U.S. 
flag as deeply inspirational, and thus contains only 
positive language. If the bill under discussion is a 
proposed flag-burning ban, then the speech is sup- 
portive; but if the bill under discussion is aimed at 
rescinding an existing flag-burning ban, the speech 
may represent opposition to the legislation. Given 
the current state of the art in sentiment analysis, 
it is doubtful that one could determine the (proba- 
bly topic-specific) relationship between presented 
evidence and speaker opinion. 

Qualitative summary of results The above dif- 
ficulties underscore the importance of enhancing 
standard classification techniques with new infor- 
mation sources that promise to improve accuracy, 
such as inter-document relationships between the 
documents to be labeled. In this paper, we demon- 
strate that the incorporation of agreement model- 
ing can provide substantial improvements over the 
application of support vector machines (SVMs) in 



such information may not lead to significantly improved re- 
sults even if it were available. 
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speech segments 
debates 

average number of speech segments per debate 
average number of speakers per debate 


3857 
53 
72.8 
32.1 


2740 860 257 

38 10 5 
72.1 86.0 51.4 
30.9 41.1 22.6 



Table 1 : Corpus statistics. 



isolation, which represents the state of the art in 
the individual classification of documents. The en- 
hanced accuracies are obtained via a fairly primi- 
tive automatically-acquired "agreement detector" 
and a conceptually simple method for integrat- 
ing isolated-document and agreement-based in- 
formation. We thus view our results as demon- 
strating the potentially large benefits of exploiting 
sentiment-related discourse-segment relationships 
in sentiment-analysis tasks. 

2 Corpus 

This section outlines the main steps of the process 
by which we created our corpus (download site: 
www.cs.comell.edu/home/llee/data/convote.html). 

GovTrack (http://govtrack.us) is an independent 
website run by Joshua Tauberer that collects pub- 
licly available data on the legislative and fund- 
raising activities of U.S. congresspeople. Due to 
its extensive cross-referencing and collating of in- 
formation, it was nominated for a 2006 "Webby" 
award. A crucial characteristic of GovTrack from 
our point of view is that the information is pro- 
vided in a very convenient format; for instance, 
the floor-debate transcripts are broken into sepa- 
rate HTML files according to the subject of the 
debate, so we can trivially derive long sequences 
of speeches guaranteed to cover the same topic. 

We extracted from GovTrack all available tran- 
scripts of U.S. floor debates in the House of Rep- 
resentatives for the year 2005 (3268 pages of tran- 
scripts in total), together with voting records for all 
roll-call votes during that year. We concentrated 
on debates regarding "controversial" bills (ones in 
which the losing side generated at least 20% of the 
speeches) because these debates should presum- 
ably exhibit more interesting discourse structure. 

Each debate consists of a series of speech seg- 
ments, where each segment is a sequence of un- 
interrupted utterances by a single speaker. Since 
speech segments represent natural discourse units, 
we treat them as the basic unit to be classified. 



Each speech segment was labeled by the vote 
("yea" or "nay") cast for the proposed bill by the 
person who uttered the speech segment. 

We automatically discarded those speech seg- 
ments belonging to a class of formulaic, generally 
one-sentence utterances focused on the yielding 
of time on the house floor (for example, "Madam 
Speaker, I am pleased to yield 5 minutes to the 
gentleman from Massachusetts"), as such speech 
segments are clearly off-topic. We also removed 
speech segments containing the term "amend- 
ment", since we found during initial inspection 
that these speeches generally reflect a speaker's 
opinion on an amendment, and this opinion may 
differ from the speaker's opinion on the underly- 
ing bill under discussion. 

We randomly split the data into training, test, 
and development (parameter-tuning) sets repre- 
senting roughly 70%, 20%, and 10% of our data, 
respectively (see Table The speech segments 
remained grouped by debate, with 38 debates as- 
signed to the training set, 10 to the test set, and 5 
to the development set; we require that the speech 
segments from an individual debate all appear in 
the same set because our goal is to examine clas- 
sification of speech segments in the context of the 
surrounding discussion. 

3 Method 

The support/oppose classification problem can be 
approached through the use of standard classi- 
fiers such as support vector machines (SVMs), 
which consider each text unit in isolation. As 
discussed in Section ^ however, the conversa- 
tional nature of our data implies the existence 
of various relationships that can be exploited to 
improve cumulative classification accuracy for 
speech segments belonging to the same debate. 
Our classification framework, directly inspired by 
Blum and Chawla (200 1*1, integrates both perspec- 
tives, optimizing its labeling of speech segments 
based on both individual speech-segment classifi- 
cation scores and preferences for groups of speech 



segments to receive the same label. In this sec- 
tion, we discuss the specific classification frame- 
work that we adopt and the set of mechanisms that 
we propose for modeling specific types of relation- 
ships. 

3.1 Classification framework 

Let si, S2, . . . , be the sequence of speech seg- 
ments within a given debate, and let y and 
J\f stand for the "yea" and "nay" class, respec- 
tively. Assume we have a non-negative func- 
tion ind{s, C) indicating the degree of preference 
that an individual-document classifier, such as an 
SVM, has for placing speech-segment s in class 
C. Also, assume that some pairs of speech seg- 
ments have weighted links between them, where 
the non-negative strength (weight) str{£) for a 
link £ indicates the degree to which it is prefer- 
able that the linked speech segments receive the 
same label. Then, any class assignment c = 
c(si), c{s2), ■ ■ ■ , c{sn) can be assigned a cost 

^indis,c{s))+ Yl Yl **^(^)' 

s s,s': c{s)jtc{s') ^ between 

where c(s) is the "opposite" class from c(s). A 
minimum-cost assignment thus represents an opti- 
mum way to classify the speech segments so that 
each one tends not to be put into the class that 
the individual-document classifier disprefers, but 
at the same time, highly associated speech seg- 
ments tend not to be put in different classes. 

As has been previously observed and exploited 
in the NLP literature CPang a nd Lee, 2004} 
Agarwal and Bhattacharyya, 2005., 
Barzilay and Lapata, 2005| l, the above opti- 
mization function, unlike many others that have 
been proposed for graph or set partitioning, can 
be solved exactly in an provably efficient manner 
via methods for finding minimum cuts in graphs. 
In our view, the contribution of our work is 
the examination of new types of relationships, 
not the method by which such relationships are 
incorporated into the classification decision. 

3.2 Classifying speech segments in isolation 

In our experiments, we employed the well-known 
classifier SVM^*^'^* to obtain individual-document 
classification scores, treating y as the positive 
class and using plain unigrams as features.^ Fol- 
lowing standard practice in sentiment analysis 

^SVM''"''* is available at svmIight.joachims.org. 
Default parameters were used, although experimenta- 



( Pang, Lee, and Vaithyanathan, 2002| l, the input to 
SVM'*^'^* consisted of normalized presence-of- 
feature (rather than frequency-of-feature) vectors. 
The ind value for each speech segment s was 
based on the signed distance d{s) from the vector 
representing s to the trained SVM decision plane: 



ind{s, y) 



dcf 







d{s) > 2a s] 



l + g)/2 \d{s)\<2<ys- 
d{s) < -2as 



where cr^ is the standard deviation of d{s) over all 
speech segments s in the debate in question, and 

ind{s,N) =^ 1 — ind{s,y). 

We now turn to the more interesting problem of 
representing the preferences that speech segments 
may have for being assigned to the same class. 

3.3 Relationships between speech segments 

A wide range of relationships between text seg- 
ments can be modeled as positive-strength links. 
Here we discuss two types of constraints that are 
considered in this work. 

Same-speaker constraints: In Congressional 
debates and in general social-discourse contexts, 
a single speaker may make a number of comments 
regarding a topic. It is reasonable to expect that in 
many settings, the participants in a discussion may 
be convinced to change their opinions midway 
through a debate. Hence, in the general case we 
wish to be able to express "soft" preferences for all 
of an author's statements to receive the same label, 
where the strengths of such constraints could, for 
instance, vary according to the time elapsed be- 
tween the statements. Weighted links are an ap- 
propriate means to express such variation. 

However, if we assume that most speakers do 
not change their positions in the course of a dis- 
cussion, we can conclude that all comments made 
by the same speaker must receive the same label. 
This assumption holds by fiat for the ground-truth 
labels in our dataset because these labels were 
derived from the single vote cast by the speaker 
on the bill being discussed.^ We can implement 

tion with different paramete r settings is an important 
direction for future work jPaelemans and Hoste, 2002| 
IMunson, Cardie, and Caruana, 2005t . 

We are attempting to determine whether a speech seg- 
ment represents support or not. This diifers from the problem 
of determining what the speaker's actual opinion is, a prob- 
lem that, as an anonymous reviewer put it, is complicated by 
"grandstanding, backroom deals, or, more innocently, plain 
change of mind ('I voted for it before I voted against it')". 



this assumption via links whose weights are essen- 
tially infinite. Although one can also implement 
this assumption via concatenation of same-speaker 
speech segments (see Section I4.3I) . we view the 
fact that our graph-based framework incorporates 
both hard and soft constraints in a principled fash- 
ion as an advantage of our approach. 

Different-speaker agreements In House dis- 
course, it is common for one speaker to make ref- 
erence to another in the context of an agreement 
or disagreement over the topic of discussion. The 
systematic identification of instances of agreement 
can, as we have discussed, be a powerful tool for 
the development of intelligently selected weights 
for links between speech segments. 

The problem of agreement identification can be 
decomposed into two sub-problems: identifying 
references and their targets, and deciding whether 
each reference represents an instance of agree- 
ment. In our case, the first task is straightfor- 
ward because we focused solely on by-name ref- 
erences.' Hence, we will now concentrate on the 
second, more interesting task. 

We approach the problem of classifying refer- 
ences by representing each reference with a word- 
presence vector derived from a window of text 
surrounding the reference.*^ In the training set, 
we classify each reference connecting two speak- 
ers with a positive or negative label depending on 
whether the two voted the same way on the bill un- 
der discussion^ . These labels are then used to train 
an SVM classifier, the output of which is subse- 
quently used to create weights on agreement links 
in the test set as follows. 

Let d{r) denote the distance from the vector 
representing reference r to the agreement-detector 
SVM's decision plane, and let ar be the standard 
deviation of d{r) over all references in the debate 
in question. We then define the strength agr of the 



^One subtlety is that for the purposes of mining agree- 
ment cues (but not for evaluating overall support/oppose 
classification accuracy), we temporarily re-inserted into our 
dataset previously filtered speech segments containing the 
term "yield", since the yielding of time on the House floor 
typically indicates agreement even though the yield state- 
ments contain little relevant text on their own. 

*We found good development-set performance using the 
30 tokens before, 20 tokens after, and the name itself. 

'since we are concerned with references that potentially 
represent relationships between speech segments, we ignore 
references for which the target of the reference did not speak 
in the debate in which the reference was made. 



Agreement classifier 
("reference=^ agreement?") 


Devel . lest 
set set 


majority baseline 

Train: no amdmts; ^agr = 

Train: with amdmts; ^agr = 


81.51 80.26 
84.25 81.07 
86.99 80.10 



Table 2: Agreement-classifier accuracy, in per- 
cent. "Amdmts"="speech segments containing the 
word 'amendment'". Recall that boldface indi- 
cates results for development-set-optimal settings. 

agreement link corresponding to the reference as: 



agr{r) 



dof 



d{r) < 6'agr; 

a-d{r)/4ar O^igr < d{r) < iar] 
a d{r) > Aur- 



The free parameter a specifies the relative impor- 
tance of the agr scores. The threshold ^agr con- 
trols the precision of the agreement links, in that 
values of 0agr greater than zero mean that greater 
confidence is required before an agreement link 
can be added. 

4 Evaluation 

This section presents experiments testing the util- 
ity of using speech-segment relationships, evalu- 
ating against a number of baselines. All reported 
results use values for the free parameter a derived 
via tuning on the development set. In the tables, 
boldface indicates the development- and test-set 
results for the development-set-optimal parameter 
settings, as one would make algorithmic choices 
based on development-set performance. 

4.1 Preliminaries: Reference classification 

Recall that to gather inter-speaker agreement in- 
formation, the strategy employed in this paper is 
to classify by-name references to other speakers 
as to whether they indicate agreement or not. 

To train our agreement classifier, we experi- 
mented with undoing the deletion of amendment- 
related speech segments in the training set. Note 
that such speech segments were never included in 
the development or test set, since, as discussed in 
Section |2 their labels are probably noisy; how- 
ever, including them in the training set allows the 

'"Our implementation puts a link between just one arbi- 
trary pair of speech segments among all those uttered by a 
given pair of apparently agreeing speakers. The "infinite- 
weight" same-speaker links propagate the agreement infor- 
mation to all other such pairs. 



Agreement classifier 


Precision (in percent): 
Devel. set Test set 


^agr = 
^agr = 


86.23 82.55 
89.41 88.47 



Table 3: Agreement-classifier precision. 

classifier to examine more instances even though 
some of them are labeled incorrectly. As Table 
121 shows, using more, if noisy, data yields bet- 
ter agreement-classification results on the devel- 
opment set, and so we use that policy in all subse- 
quent experiments. ' ' 

An important observation is that precision may 
be more important than accuracy in deciding 
which agreement links to add: false positives with 
respect to agreement can cause speech segments 
to be incorrectly assigned the same label, whereas 
false negatives mean only that agreement-based 
information about other speech segments is not 
employed. As described above, we can raise 
agreement precision by increasing the threshold 
^agr. which specifies the required confidence for 
the addition of an agreement link. Indeed, Table 
|3] shows that we can improve agreement precision 
by setting 0agr to the (positive) mean agreement 
score fi assigned by the SVM agreement-classifier 
over all references in the given debate'^. How- 
ever, this comes at the cost of greatly reducing 
agreement accuracy (development: 64.38%; test: 
66.18%) due to lowered recall levels. Whether 
or not better speech-segment classification is ulti- 
mately achieved is discussed in the next sections. 

4.2 Segment-based speech-segment 
classification 

Baselines The first two data rows of Table 
0] depict baseline performance results. The 
# ("support") — #("oppos") baseline is meant 
to explore whether the speech-segment classifica- 
tion task can be reduced to simple lexical checks. 
Specifically, this method uses the signed differ- 
ence between the number of words containing the 
stem "support" and the number of words contain- 
ing the stem "oppos" (returning the majority class 
if the difference is 0). No better than 62.67% test- 
set accuracy is obtained by either baseline. 



Support/oppose classifer 
("speech segment =^yea?") 


Devel. lest 
set set 


majority baseline 
#("support") - #("oppos") 


54.09 58.37 
59.14 62.67 


SVM [speech segment] 
SVM -1- same-speaker links 
SVM -1- same-speaker links . . . 
-1- agreement links, 6'agr = 
-1- agreement links, 6'agr = 1^ 


70.04 66.05 
79.77 67.21 

89.11 70.81 

87.94 71.16 



Table 4: Segment-based speech-segment classifi- 
cation accuracy, in percent. 



Support/oppose classifer 
("speech segment^yea?") 


Devel. Test 
set set 


SVM [speaker] 

SVM -1- agreement links . . . 

with ^agr = 

with ^agr = 


71.60 70.00 

88.72 71.28 

84.44 76.05 



Table 5: Speaker-based speech-segment classi- 
fication accuracy, in percent. Here, the initial 
SVM is run on the concatenation of all of a given 
speaker's speech segments, but the results are 
computed over speech segments (not speakers), so 
that they can be compared to those in Table |3 



"Unfortunately, this poli cy le ads to inferior test-set agree- 
ment classification. Section l431 contains further discussion. 

'"We elected not to explicitly tune the value of ^agr in or- 
der to minimize the number of free parameters to deal with. 



Using relationsliip information Applying an 
SVM to classify each speech segment in isolation 
leads to clear improvements over the two base- 
line methods, as demonstrated in Table |4] When 
we impose the constraint that all speech segments 
uttered by the same speaker receive the same la- 
bel via "same-speaker links", both test-set and 
development-set accuracy increase even more, in 
the latter case quite substantially so. 

The last two lines of Table |4] show that the 
best results are obtained by incorporating agree- 
ment information as well. The highest test-set re- 
sult, 71.16%, is obtained by using a high-precision 
threshold to determine which agreement links to 
add. While the development-set results would in- 
duce us to utilize the standard threshold value of 0, 
which is sub-optimal on the test set, the ^agr = 
agreement-link policy still achieves noticeable im- 
provement over not using agreement links (test set: 
70.81% vs. 67.21%). 



4.3 Speaker-based speech-segment 
classification 

We use speech segments as the unit of classifica- 
tion because they represent natural discourse units. 
As a consequence, we are able to exploit relation- 
ships at the speech-segment level. However, it is 
interesting to consider whether we really need to 
consider relationships specifically between speech 
segments themselves, or whether it suffices to sim- 
ply consider relationships between the speakers 
of the speech segments. In particular, as an al- 
ternative to using same-speaker links, we tried a 
speaker-based approach wherein the way we de- 
termine the initial individual-document classifica- 
tion score for each speech segment uttered by a 
person p in a given debate is to run an SVM on the 
concatenation of all of p's speech segments within 
that debate. (We also ensure that agreement-link 
information is propagated from speech-segment to 
speaker pairs.) 

How does the use of same-speaker links com- 
pare to the concatenation of each speaker's speech 
segments? Tables |4] and |5] show that, not sur- 
prisingly, the SVM individual-document classifier 
works better on the concatenated speech segments 
than on the speech segments in isolation. How- 
ever, the effect on overall classification accuracy 
is less clear: the development set favors same- 
speaker links over concatenation, while the test set 
does not. 

But we stress that the most important obser- 
vation we can make from Table |5] is that once 
again, the addition of agreement information leads 
to substantial improvements in accuracy. 

4.4 "Hard" agreement constraints 

Recall that in in our experiments, we created 
finite-weight agreement links, so that speech seg- 
ments appearing in pairs flagged by our (imper- 
fect) agreement detector can potentially receive 
different labels. We also experimented with forc- 
ing such speech segments to receive the same la- 
bel, either through infinite-weight agreement links 
or through a speech-segment concatenation strat- 
egy similar- to that described in the previous sub- 
section. Both strategies resulted in clear degrada- 
tion in performance on both the development and 
test sets, a finding that validates our encoding of 
agreement information as "soft" preferences. 



4.5 On the development/test set split 

We have seen several cases in which the method 
that performs best on the development set does 
not yield the best test-set performance. However, 
we felt that it would be illegitimate to change the 
train/development/test sets in a post hoc fashion, 
that is, after seeing the experimental results. 

Moreover, and crucially, it is very clear that 
using agreement information, encoded as prefer- 
ences within our graph-based approach rather than 
as hard constraints, yields substantial improve- 
ments on both the development and test set; this, 
we believe, is our most important finding. 

5 Related work 

Politically-oriented text Sentiment analysis 
has specifically been proposed as a key enabling 
technology in eRulemaking, allowing the au- 
tomatic analysis of the opinions that people 
submit fShulman et al., 2005'; 'Car die et al, 2006| 
Kwon, Shulman, and Hovy, 2006 1. There has 
also been work focused upon determining 
the political leaning (e.g., "liberal" vs. "con- 
servative") of a document or author, where 
most previously-proposed methods make no 
direct use of relationships between the docu- 
ments to be classified (the "unlabeled" texts) 
(Laver, Benoit, and Garry, 2003| |Efron, 20041 
.Mullen and Malouf, 2006) . An exception is 
[Grefenstette et al. (2004| l, who experimented with 
determining the political orientation of websites 
essentially by classifying the concatenation of all 
the documents found on that site. 

Others have applied the NLP tech- 
nologies of near-duplicate detection and 
topic-based text categorization to politi- 
cally oriented text ( ,Yang and Callan, 2005} 
Purpura and Hillard, 2006| l. 

Detecting agreement We used a simple 
method to learn to identify cross-speaker 
references indicating agreement. More so- 
phisticated approaches have been proposed 
(Hillard, Ostendorf, and Shriberg, 2003| l, in- 
cluding an extension that, in an interesting 
reversal of our problem, makes use of sentiment- 
polarity indicators within speech segments 
(Galley et al., 2004) l. Also relevant is work 
on the general problems of dialog-act tag- 
ging ( IStolcke et al, 200()l i, citation analysis 
dLehnert, Cardie, and Riloff, 1990t , and com- 



putational rhetorical analysis ( |Marcu, 2000| 
ITeufel and Moens, 2002t . 

We currently do not have an efficient means 
to encode disagreement information as hard con- 
sti^aints; we plan to investigate incorporating such 
information in future work. 

Relationships between the unlabeled items 

ICarvalho and Cohen (2005 1 consider sequential 
relations between different types of emails (e.g., 
between requests and satisfactions thereof) to clas- 
sify messages, and thus also explicitly exploit the 
structure of conversations. 

Previous sentiment-analysis work in differ- 
ent domains has considered inter-document 



similarity f Agarwal and Bhattacharyya, 2005 


Pang and Lee, 2005, 


Goldberg and Zhu, 2006 


or 



explicit inter-document references in the form of 



hyperhnks ( Agrawal et al, 2003 I. 

Notable early papers on graph-based semi- 
supervised learning include Blum and Chawla 
d^OOl I, Bansal, Blum, and Chawla (2002 ), Kondor 
and Lafferty (l20()2l . and Joachims Sm03i . Zhu 
(|2005 1 maintains a survey of this area. 

Recently, several alternative, often quite so- 
phisticated approaches to collective classification 
have been proposed ( |Neville and Jensen, 2000t 
P^afferty, McCallum, and Pereira, 2001} 
|Getoor et al., 2002) 
ITaskar, Abbeel, and KoUer, 20021 
[Taskar, Guestrin, and KoUer, 2003 1 
ITaskar, Ch atalbashev, and KoUe r, 20041 
IMcCallum and Wellner, 2004 ). It would be 
interesting to investigate the application of such 
methods to our problem. However, we also be- 
lieve that our approach has important advantages, 
including conceptual simplicity and the fact that it 
is based on an underlying optimization problem 
that is provably and in practice easy to solve. 

6 Conclusion and future work 

In this study, we focused on very general types 
of cross-document classification preferences, uti- 
lizing constraints based only on speaker identity 
and on direct textual references between state- 
ments. We showed that the integration of even 
very limited information regarding inter-document 
relationships can significantly increase the accu- 
racy of support/opposition classification. 

The simple constraints modeled in our study, 
however, represent just a small portion of the 



rich network of relationships that connect state- 
ments and speakers across the political universe 
and in the wider realm of opinionated social dis- 
course. One intriguing possibility is to take ad- 
vantage of (readily identifiable) information re- 
garding interpersonal relationships, making use of 
speaker/author affiliations, positions within a so- 
cial hierarchy, and so on. Or, we could even at- 
tempt to model relationships between topics or 
concepts, in a kind of extension of collaborative 
filtering. For example, perhaps we could infer that 
two speakers sharing a common opinion on evo- 
lutionary biologist Richard Dawkins (a.k.a. "Dar- 
win's rottweiler") will be likely to agree in a de- 
bate centered on Intelligent Design. While such 
functionality is well beyond the scope of our cur- 
rent study, we are optimistic that we can develop 
methods to exploit additional types of relation- 
ships in future work. 

Acknowledgments We thank Claire Cardie, Jon 
Kleinberg, Michael Macy, Andrew Myers, and the 
six anonymous EMNLP referees for valuable dis- 
cussions and comments. We also thank Reviewer 
1 for generously providing additional post hoc 
feedback, and the EMNLP chairs Eric Gaussier 
and Dan Jurafsky for facilitating the process (as 
well as for allowing authors an extra proceed- 
ings page. . .). This paper is based upon work 
supported in part by the National Science Foun- 
dation under grant no. IIS-0329064 and an Al- 
fred P. Sloan Research Fellowship. Any opinions, 
findings, and conclusions or recommendations ex- 
pressed are those of the authors and do not neces- 
sarily reflect the views or official policies, either 
expressed or implied, of any sponsoring institu- 
tions, the U.S. government, or any other entity. 

References 

[Agarwal and Bhattacharyya2005] Agarwal, Alekh and 
Pushpak Bhattacharyya. 2005. Sentiment analy- 
sis: A new approach for effective use of linguistic 
knowledge and exploiting similarities in a set of doc- 
uments to be classified. In Proceedings of the Inter- 
national Conference on Natural Language Process- 
ing (ICON). 

[Agrawal et al.2003] Agrawal, Rakesh, Sridhar Ra- 
jagopalan, Ramakrishnan Srikant, and Yirong Xu. 
2003. Mining newsgroups using networks arising 
from social behavior. In Proceedings of WWW, 
pages 529-535. 

[Bansal, Blum, and Chawla2002] Bansal, Nikhil, 
Avrim Blum, and Shuchi Chawla. 2002. Correla- 



tion clustering. In Proceedings of the Symposium on 
Foundations of Computer Science (FOCS), pages 
238-247. Journal version in Machine Learning 
Journal, special issue on theoretical advances in 
data clustering, 56(l-3):89-113 (2004). 

[Barzilay and Lapata2005] Barzilay, Regina and 
Mirella Lapata. 2005. Collective content selection 
for concept-to-text generation. In Proceedings of 
HLT/EMNLP, pages 331-338. 

[Blum and Chawla2001] Blum, Avrim and Shuchi 
Chawla. 2001. Learning from labeled and unla- 
beled data using graph mincuts. In Proceedings of 
ICML, pages 19-26. 

[Cardie et al.2006] Cardie, Claire, Cynthia Farina, 
Thomas Bruce, and Erica Wagner. 2006. Using 
natural language processing to improve eRulemak- 
ing. In Proceedings of Digital Government Re- 
search (dg.o). 

[Carvalho and Cohen2005] Carvalho, Vitor and 
William W. Cohen. 2005. On the collective classi- 
fication of email "speech acts". In Proceedings of 
SIGIR, pages 345-352. 

[Daelemans and Hoste2002] Daelemans, Walter and 
Veronique Hoste. 2002. Evaluation of machine 
learning methods for natural language processing 
tasks. In Proceedings of the Third International 
Conference on Language Resources and Evaluation 
(LREC), pages 755-760. 

[Das and Chen2001] Das, Sanjiv and Mike Chen. 
2001. Yahoo! for Amazon: Extracting market sen- 
timent from stock message boards. In Proceedings 
of the Asia Pacific Finance Association Annual Con- 
ference (APFA). 

[Dave, Lawrence, and Pennock2003] Dave, Kushal, 
Steve Lawrence, and David M. Pennock. 2003. 
Mining the peanut gallery: Opinion extraction 
and semantic classification of product reviews. In 
Proceedings of WWW, pages 519-528. 

[Efron2004] Efron, Miles. 2004. Cultural orientation: 
Classifying subjective documents by cociation [sic] 
analysis. In Proceedings of the AAAI Fall Sympo- 
sium on Style and Meaning in Language, Art, Music, 
and Design, pages 41^8. 

[Esuli2006] Esuli, Andrea. 2006. Senti- 
ment classification bibUography. liin- 
www.ira.uka.de/bibliography/Misc/Sentiment.html. 

[Galley et al.2004] Galley, Michel, Kathleen McKe- 
own, Julia Hirschberg, and Elizabeth Shriberg. 
2004. Identifying agreement and disagreement in 
conversational speech: Use of Bayesian networks to 
model pragmatic dependencies. In Proceedings of 
the 42nd ACL, pages 669-676. 

[Getoor et al.2002] Getoor, Lise, Nir Friedman, 
Daphne Koller, and Benjamin Taskar. 2002. Learn- 
ing probabilistic models of relational structure. 



Journal of Machine Learning Research, 3:619-101 . 
Special issue on the Eighteenth ICML. 

[Goldberg and Zhu2006] Goldberg, Andrew B. and 
Jerry Zhu. 2006. Seeing stars when there aren't 
many stars: Graph-based semi-supervised learn- 
ing for sentiment categorization. In TextGraphs: 
HLT/NAACL Workshop on Graph-based Algorithms 
for Natural Language Processing. 

[Grefenstette et al.2004] Grefenstette, Gregory, Yan 
Qu, James G. Shanahan, and David A. Evans. 2004. 
Couphng niche browsers and affect analysis for 
an opinion mining application. In Proceedings of 
RIAO. 

[Hearstl992] Hearst, Marti. 1992. Direction-based 
text interpretation as an information access refine- 
ment. In Paul Jacobs, editor, Text-Based Intelligent 
Systems. Lawrence Erlbaum Associates, pages 257- 
274. 

[Hillard, Ostendorf, and Shriberg2003] Hillard, Dustin, 
Man Ostendorf, and Elizabeth Shriberg. 2003. De- 
tection of agreement vs. disagreement in meetings: 
Training with unlabeled data. In Proceedings of 
HLT-NAACL. 

[Joachims2003] Joachims, Thorsten. 2003. Transduc- 
tive learning via spectral graph partitioning. In Pro- 
ceedings of ICML, pages 290-297. 

[Kondor and Lafferty2002] Kondor, Risi Imre and 
John D. Lafferty. 2002. Diffusion kernels on graphs 
and other discrete input spaces. In Proceedings of 
ICML, pages 315-322. 

[Kwon, Shulman, and Hovy2006] Kwon, Namhee, Stu- 
art Shulman, and Eduard Hovy. 2006. Multidimen- 
sional text analysis for eRulemaking. In Proceed- 
ings of Digital Government Research (dg.o). 

[Lafferty, McCallum, and Pereira2001] Lafferty, John, 
Andrew McCallum, and Fernando Pereira. 2001. 
Conditional random fields: Probabilistic models for 
segmenting and labeling sequence data. In Proceed- 
ings of ICML, pages 282-289. 

[Laver, Benoit, and Garry2003] Laver, Michael, Ken- 
neth Benoit, and John Garry. 2003. Extracting pol- 
icy positions from political texts using words as data. 
American Political Science Review. 

[Lehnert, Cardie, and Riloffl990] Lehnert, Wendy, 
Claire Cardie, and Ellen Riloff. 1990. Analyz- 
ing research papers using citation sentences. In 
Program of the Twelfth Annual Conference of the 
Cognitive Science Society, pages 511-18. 

[Marcu2000] Marcu, Daniel. 2000. The theory and 
practice of discourse parsing and summarization. 
MIT Press. 

[McCallum and Wellner2004] McCallum, Andrew and 
Ben Wellner. 2004. Conditional models of identity 
uncertainty with application to noun coreference. In 
Proceedings of NIPS. 



[Mullen and Malouf2006] Mullen, Tony and Robert 
Malouf. 2006. A preliminary investigation into 
sentiment analysis of informal political discourse. 
In Proceedings of the AAAI Symposium on Com- 
putational Approaches to Analyzing Weblogs, pages 
159-162. 

[Munson, Cardie, and Caruana2005] Munson, Art, 
Claire Cardie, and Rich Caruana. 2005. Optimizing 
to arbitrary NLP metrics using ensemble selection. 
In Proceedings ofHLT-EMNLP, pages 539-546. 

[Neville and Jensen2000] Neville, Jennifer and David 
Jensen. 2000. Iterative classification in relational 
data. In Proceedings of the AAAI Workshop on 
Learning Statistical Models from Relational Data, 
pages 13-20. 

[Pang and Lee2004] Pang, Bo and Lillian Lee. 2004. 
A sentimental education: Sentiment analysis using 
subjectivity summarization based on minimum cuts. 
In Proceedings of the ACL, pages 271-278. 

[Pang and Lee2005] Pang, Bo and Lillian Lee. 2005. 
Seeing stars: Exploiting class relationships for sen- 
timent categorization with respect to rating scales. 
In Proceedings of the ACL. 

[Pang, Lee, and Vaithyanathan2002] Pang, Bo, Lillian 
Lee, and Shivakumar Vaithyanathan. 2002. Thumbs 
up? Sentiment classification using machine learning 
techniques. In Proceedings of EMNLP, pages 79- 
86. 

[Purpura and Hillard2006] Purpura, Stephen and 
Dustin Hillard. 2006. Automated classification of 
congressional legislation. In Proceedings of Digital 
Government Research (dg.o). 

[Sackl994] Sack, Warren. 1994. On the computation 
of point of view. In Proceedings of AAAL page 
1488. Student abstract. 

[Shulman et al.2005] Shulman, Stuart, Jamie Callan, 
Eduard Hovy, and Stephen Zavestoski. 2005. Lan- 
guage processing technologies for electronic rule- 
making: A project highlight. In Proceedings of Dig- 
ital Government Research (dg.o), pages 87-88. 

[Shulman and Schlosberg2002] Shulman, Stuart and 
David Schlosberg. 2002. Electronic rulemaking: 
New frontiers in public participation. Prepared for 
the Annual Meeting of the American Political Sci- 
ence Association. 

[Smith, Roberts, and Vander Wielen2005] Smith, 

Steven S., Jason M. Roberts, and Ryan J. Vander 
Wielen. 2005. The American Congress. Cambridge 
University Press, fourth edition. 

[Stolcke et al.2000] Stolcke, Andreas, Noah Coccai'o, 
Rebecca Bates, Paul Taylor, Carol Van Ess-Dykema, 
Klaus Ries, Elizabeth Shriberg, Daniel Jurafsky, 
Rachel Martin, and Marie Meteer 2000. Dialogue 
act modeling for automatic tagging and recognition 
of conversational speech. Computational Linguis- 
tics, 26(3):339-373. 



[Taskar, Abbeel, and Koller2002] Taskar, Ben, Pieter 
Abbeel, and Daphne Koller. 2002. Discriminative 
probabilistic models for relational data. In Proceed- 
ings of UAL Edmonton, Canada. 

[Taskar, Chatalbashev, and Koller2004] Taskar, Ben, 
Vassil Chatalbashev, and Daphne Koller. 2004. 
Learning associative Markov networks. In Proceed- 
ings oflCML. 

[Taskar, Guestrin, and Koller2003] Taskar, Ben, Carlos 
Guestrin, and Daphne Koller 2003. Max-margin 
Markov networks. In Proceedings of NIPS. 

[Teufel and Moens2002] Teufel, Simone and Marc 
Moens. 2002. Summarizing scientific articles: 
Experiments with relevance and rhetorical status. 
Computational Linguistics, 28(4):409^45. 

[Turney2002] Turney, Peter. 2002. Thumbs up or 
thumbs down? Semantic orientation applied to un- 
supervised classification of reviews. In Proceedings 
of the ACL, pages 417-424. 

[Wiebel994] Wiebe, Janyce M. 1994. Tracking point 
of view in narrative. Computational Linguistics, 
20(2):233-287. 

[Wiebe and Rapaportl988] Wiebe, Janyce M. and 
William J. Rapaport. 1988. A computational 
theory of perspective and reference in narrative. In 
Proceedings of the ACL, pages 131-138. 

[Yang and Callan2005] Yang, Hui and Jamie Callan. 
2005. Near-duplicate detection for eRulemaking. In 
Proceedings of Digital Government Research (dg.o). 

[Zhu2005] Zhu, Jerry. 2005. Semi-supervised 
learning literature survey. Computer Sci- 
ences Technical Report TR 1530, Univer- 
sity of Wisconsin-Madison. Available at 
http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf i 
has been updated since the initial 2005 version. 



